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Abstract 


The overall aim of the current investigation was to develop and validate the initial version of the Minnesota Inference 
Assessment (MIA). MIA is a web-based measure of inference processes in K—-2. MIA leverages the affordances of different 
media to evaluate inference processes in a nonreading context, using age-appropriate fiction and nonfiction videos coupled 
with questioning. We evaluated MIA’s technical adequacy in a proof-of-concept study. Taken together, the results support 
the interpretation that MIA shows promise as a valid and reliable measure of inferencing in a nonreading context for 
students in Grades K—2. Future directions involve further development of multiple, parallel forms that can be used for 


progress monitoring in K—2. 
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Theories of comprehension specify how comprehension 
emerges during moment-by-moment processing and assume 
that it is based on a mental representation that is constructed 
over the course of reading (e.g., Kintsch, 1998). Most theo- 
ries also emphasize inferences as core processes because 
they are necessary for the construction of a coherent situa- 
tion model, which is the product of reading comprehension 
(McNamara & Magliano, 2009). Two types of inferences 
are necessary in this context (Graesser, 2015; McNamara, 
2004): Bridging inferences, which involve connections 
between ideas within the text, and elaborative inferences, 
which involve connections between information in the text 
and prior knowledge (Barth et al., 2015). 

To understand how inference processes break down for 
readers who fail to create a coherent mental representation of a 
text, it is essential to evaluate the construction of this represen- 
tation as it occurs moment-by-moment (Kendeou, 2015; 
McMaster et al., 2012). However, comprehension is typically 
assessed after reading a text, providing very little insight into 
the inferential processes during reading and why these pro- 
cesses may succeed or fail. As we elaborate next, there are 
readers who struggle with precisely these processes, but exist- 
ing measures that examine individual differences in compre- 
hension processes are limited in the information they provide. 


Existing Measures of Inference 
Processes 


There are very few measures of inference processes in the 
context of reading comprehension. One such measure is 


the Multiple-Choice Online Causal Comprehension 
Assessment (MOCCA; Biancarosa et al., 2019; Carlson 
et al., 2014). MOCCA evaluates the processes by which 
students in Grades 3 to 5 generate causal inferences. For 
this measure, students read short texts in which one sen- 
tence is omitted. Students are presented three sentences 
and choose which one best fills in the omitted line from 
each text. The two incorrect responses represent processes 
on which poor comprehenders have been shown to rely; 
thus, patterns in response selections provide diagnostic 
information about the processes students engage in. 
Carlson et al. reported correlations between MOCCA 
scores and several reading-related measures (e.g., DIBELS 
Oral Reading Fluency [ORF]; Woodcock—Johnson [WJ]- 
III word ID; Curriculum-Based Measurement [CBM] 
Maze) between r = —.37 to .75, and generally acceptable 
internal consistency (i.e., a > .60). A second measure is 
the Bridging Inferences Test (Bridge-IT; Barth et al., 2015). 
Bridge-IT was designed to evaluate inferential processes 
for students in Grades 6 to 12 using an inconsistency para- 
digm. Students read a set of sentences and judge whether a 
continuation sentence is consistent with previously read 
information. Accuracy and response times assess inferential 
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processes. Barth et al. showed that the Bridge-IT accounted 
for unique variance in reading comprehension (i.e., Gates- 
MacGinitie Reading Test [GMRT]; MacGinitie & 
MacGinitie, 1989), with scores also showing good internal 
consistency (1.e., a > .80). Performance on both of these 
measures, though, also depends on students’ decoding abil- 
ity and, thus, may not provide an accurate index of stu- 
dents’ inference skills. 

To remove reliance on decoding, measures that evaluate 
inference processes in nonreading contexts have also been 
developed. One such measure is the Learning and Reading 
Research Consortium (LARRC) Inference Task (LARRC, 
2015) for Grades prekK-4. The test consists of two stories at 
each grade level, each followed by eight open-ended ques- 
tions that assess students’ ability to generate local and global 
coherence inferences. The stories and questions were based 
on the work of Cain and Oakhill (1999). LARRC and 
Muijselaar (2018) reported correlations between r = .37 and 
45 with measures of listening comprehension (Clinical 
Evaluation of Language Fundamentals: Fourth Edition— 
Understanding Spoken Paragraphs [CELF-4-USP]; Semel 
et al., 2003) in K—3, and acceptable internal consistency (i.e., 
a. > .64). Another measure is the Know-IT (Barnes et al., 1996) 
for students aged 6 to 15 years. Students are first taught a series 
of facts about a fictional world called “Gan” and are later tested 
on those facts. Then 10 episodes/paragraphs of a text about Gan 
are read aloud to the student followed by four open-ended ques- 
tions (elaborative inference, bridging inference, literal question, 
and a simile question) that the student answers orally. Even 
though performance on both of these measures does not rely on 
decoding, they must be administered and manually scored in a 
time-intensive one-on-one context, limiting feasibility for class- 
room use at scale. Also, both measures have only one form, 
limiting use to evaluate the effects of instruction or an interven- 
tion (e.g., in a pre—post design). 

Given the aforementioned limitations of existing mea- 
sures, new assessments of inferencing are needed that (a) 
gauge moment-by-moment inferential processes, (b) allow 
assessment without reliance on decoding, (c) show feasibil- 
ity for easy and efficient classroom administration, and (d) 
offer multiple equivalent forms to monitor student progress 
as well as evaluate the effects of instruction. To address 
these needs, we developed the Minnesota Inference 
Assessment (MIA), a web-based measure of inferencing for 
students in Grades K—2. MIA has a strong theoretical basis. 
It draws on the Inferential Language Comprehension (iLC) 
framework (Kendeou et al., 2020), which was proposed to 
guide the use of visual narratives to teach and assess infer- 
encing skills in educational settings. 


The iLC Framework 


The iLC framework (Kendeou et al., 2020) proposes that a 
general inferencing skill underlies successful language 


comprehension and can transfer across contexts and media. 
Across different media, information that supports the con- 
struction of a coherent mental representation can be reacti- 
vated using targeted questioning (Graesser & Franklin, 
1990). This information can be further integrated through 
the use of scaffolding and specific feedback. There is 
empirical evidence for inferencing as a general skill that 
transfers across media (e.g., Kendeou, 2015). For example, 
children aged 4 through 8 years generated both bridging 
and elaborative inferences when asked questions after 
aural, televised, and written stories (Kendeou et al., 2008). 
Regardless of media, the inferences that these children 
generated predicted their overall reading comprehension 
performance longitudinally. One explanation for transfer 
of inferencing skill is that learners engage in the same cog- 
nitive processes to construct a mental representation of the 
information across media (Kintsch, 1998; Magliano et al., 
2007). Another explanation is that the same underlying text 
factors (e.g., causal connections, explicated goals, event 
boundaries) predict comprehension of aural, televised, and 
written (e.g., Magliano et al., 2012). 

In short, the iLC framework proposes that inferencing is 
a general skill that transfers across media and can be 
assessed in both reading and nonreading contexts. 
Consistent with this idea, MIA leverages the processing 
similarities between reading and nonreading contexts to 
measure inferencing using videos. 


The Present Study 


The overall goal of the current investigation was to 
develop and validate the initial version of MIA, a web- 
based measure of inference processes that does not rely on 
decoding, for students in K—2. Evidence for technical 
quality focused specifically on validity, reliability/preci- 
sion, and intended use of scores (AERA et al., 2014). With 
respect to validity, we examined the extent to which the 
difficulty level of MIA items aligned with students’ ability 
levels, thereby indicating that MIA is suitable for its target 
population. We also examined the extent to which a unidi- 
mensional structure fits the data using the Rasch Model 
(Rasch, 1960) and confirmatory factor analysis (CFA). In 
addition, we examined criterion-related validity based on 
correlations with general measures of language and read- 
ing comprehension that involved some level of inferenc- 
ing. With respect to reliability/precision, we examined the 
internal consistency of MIA using multiple indices (i.e., 
internal consistency coefficients and person separation 
index from the Rasch Model). Finally, with respect to the 
intended use of scores, we explored the adequacy of four 
parallel, equivalent forms from MIA (created using auto- 
mated test assembly [ATA] procedures) in evaluating the 
effects of instruction. 
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Method 


Participants 


The current dataset was drawn from a proof-of-concept 
study examining the efficacy of an inferencing intervention 
for students in Grades K—2. The current sample included n 
= 272 students: Kindergarten (n = 146, 60 female), Grade 
1 (n = 102, 52 female), and Grade 2 (n = 24, 11 female) 
from a Midwestern school district. The sample was racially 
and ethnically diverse—Kindergarten: 38.4% White, 37% 
Hispanic, 13% African American, 7.5% Asian/Pacific 
Islander, and 2.1% American Indian; Grade 1: 15.7% White, 
46.1% Hispanic, 23.5% African American, 12.7% Asian/ 
Pacific Islander, and 2% American Indian; Grade 2: 8.3% 
White, 62.5% Hispanic, and 29.2% African American. The 
sample was also economically diverse: 59% of Kindergarten 
students, 74.5% of Grade 1 students, and 87% of Grade 2 
students qualified for free or reduced lunch. 


Measures 


MIA. MIA consists of four modules. Each module includes 
a 5-min video (one fiction, one nonfiction) and 16 inferen- 
tial multiple-choice questions, each with one correct answer 
and three meaningful distractors (i.e., incorrect answer 
choices). Fiction videos were adapted from Blinky Bill car- 
toon episodes (Apple Thieves and Granny’s Glasses). Non- 
fiction videos were adapted from animal documentaries 
(Cephalopods and Eagles). Students can complete each 
module in approximately 20 min. Questions are adminis- 
tered aurally via an animated pedagogical agent as shown in 
Figure S1 (see Online Supplemental materials). The ques- 
tions interrupt the video to prompt inferences at the point in 
the video when those inferences are necessary for compre- 
hension (i.e., online inferencing). The current version 
includes two types of inferential questions that are required 
for comprehension, bridging and elaborative. The process 
for video selection, editing, and question writing is described 
in Figure $2 (see Online Supplemental materials). 


Language comprehension. The CELF Fifth Edition—Under- 
standing Spoken Paragraphs (CELF-5-USP) subtest (Wiig 
et al., 2003) was used as a criterion measure. The CELF-5- 
USP is an individually administered, nationally normed 
assessment. Participants listen to brief passages and answer 
questions that target several dimensions of language compre- 
hension (i.e., understanding of the main idea, memory of facts 
and details, recall of event sequences, and making inferences 
and predictions). Answers to questions are recorded and 
scored as correct or incorrect according to response norms 
provided by test developers. Seven raters independently dou- 
ble-scored a third of the audio-recordings. The overall inter- 
rater reliability was 93.6% for the 5 to 6 age form and 93.5% 
for the 7 to 8 age form. Discrepancies were resolved through 


discussion. Internal consistency was a = .81 for the 5 to 6 age 
form and a = .87 for the 7 to 8 age form. Students’ scale 
scores were used in the analyses. 


Reading comprehension. GMRT-IV (MacGinitie & MacGin- 
itie, 1989) was used as a criterion measure in Grades | and 
2. The GMRT consists of 10 passages. For Grade 1, there 
are seven narrative and three expository passages. For 
Grade 2, there are six narrative and four expository pas- 
sages. Each passage is divided into short segments. Each 
segment is accompanied by a multiple-choice item consist- 
ing of three images. Students select the image that matches 
the content of the story. GMRT is group administered and 
limited to 35 min. Reliability for scores were KR-20 = .88 
for Level 1 and KR-20 = .92 for Level 2. The Adaptive 
Reading measure (aReading; Christ et al., 2014; a > .91 
across Grades K—5) was used as a criterion measure in Kin- 
dergarten. aReading is a computer adaptive measure that 
evaluates concepts of print, phonological awareness, and 
decoding. Students’ item response theory (IRT)-derived 
theta scores were used in the analyses. 


Procedure 


All students completed MIA and CELF-5-USP before and 
after an 8-week supplemental instructional program 
designed to train inference making in kindergarten (project 
ELCII) and Grades 1 and 2 (project TELCI). Students in 
kindergarten completed aReading, whereas students in 
Grades 1 and 2 completed the GMRT. All measures were 
administered in a quiet space by trained research staff. 


Data Analysis and Results 
Validity Evidence 


First, we examined the item difficulty and discrimination of 
MIA items using classical test theory (CTT) to identify prob- 
lematic items and select a final set of items that captures a 
wide range of abilities. Item difficulty is the proportion of 
students who respond to the item correctly. Item difficulty is 
higher when fewer students answer the item correctly. Item 
discrimination is the correlation between responses for an 
item and the total score, which indicates whether the item 
measures the same construct as the other items. A high-qual- 
ity item should have a moderate to high correlation (>.20) 
with the total score (Everitt & Skrondal, 2010). As can be 
seen in Table S1 (see Online Supplemental materials), item 
difficulty values covered a wide range, though some items 
had low values, indicating that these items were generally 
difficult for most students. Particularly, items in Cephalopods 
were the most difficult. Items in Granny's Glasses and 
Eagles had similar difficulty. Items in Apple Thieves 
appeared to cover the widest range in difficulty. 
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Second, we used the Rasch model (Rasch, 1960) to cali- 
brate MIA items and calculate item difficulty parameters and 
student abilities within the IRT framework using Winsteps 
(Linacre, 2019). Regarding item calibration results from the 
Rasch model, Table S2 (see Online Supplemental materials) 
shows the estimated item parameters and their standard 
errors. The Rasch model places both items and students onto 
the same logit scale. Figure $3 (see Online Supplemental 
materials) shows that the item difficulty values ranged from 
—2 to +2, and students’ ability levels ranged from —3 to +2. 
Thus, students’ ability levels aligned with the difficulty lev- 
els of the items. Figure S4 (see Online Supplemental materi- 
als) shows that MIA was highly informative (i.e., precise in 
terms of measurement) within (—2 to +2) range of student 
ability, and the amount of conditional standard error of mea- 
surement (cSEM) was quite low within this range. In addi- 
tion, the item set is well discriminating within this range. 
Overall, these findings suggest that the difficulty level of 
MIA was suitable for this sample of students in K—2. 

Third, we evaluated construct validity evidence. The 
results of the Rasch model indicated that a unidimensional 
vertical scale indeed had a good fit to the data and explained 
40% of the total variance. We also used CFA to confirm the 
unidimensional structure. The CFA results indicated that the 
assessment had acceptable levels of model-data fit (com- 
parative fit index [CFI] = .90, Tucker—Lewis index [TLI] = 
.90, root mean square error of approximation [RMSEA] = 
.023). These findings suggest that there is adequate evi- 
dence supporting the unidimensionality of MIA. 

Finally, we examined criterion-related validity evidence. 
To do so, we computed bivariate correlations among MIA 
and measures of language comprehension and reading. 
Consistent with other nonreading inference-focused mea- 
sures (e.g., LARRC Inference Task), we expected weak to 
moderate positive correlations. Indeed, correlations between 
MIA and CELF-5-USP ranged from r = .322 to .403 (p < 
.01) in K-2; MIA and GMRT from r = .326 to .339 (p < 
.01) in Grades 1 and 2; and MIA and aReading from r = 
.216 to .385 (p < .05) in K. 


Reliability/Precision Evidence 


We examined the reliability and precision of MIA scores 
based on a variety of indices. Particularly, the coefficient 
alpha value was .88, suggesting that the assessment had 
good internal consistency. Alternative reliability indices 
were also evaluated: Guttman L2 = .88, Feldt-Brennan = 
.88, and Feldt-Gilmer = .88. The Person separation index 
from the Rasch model was .85, which also suggests that 
MIA produces reliable scores. 


Intended Use of Scores 


We evaluated the adequacy of four parallel, equivalent forms 
with fewer items, to provide evidence for instructional 


sensitivity. To create the forms, we implemented an ATA 
procedure, which uses computer algorithms and mathemati- 
cal optimization techniques to construct parallel test forms 
that satisfy a set of psychometric, content, and test adminis- 
tration constraints. For MIA, the parallel test forms were 
developed based on the following constraints: (a) each form 
will consist of 16 items, (b) each item can be used only once 
across the forms, (c) each form will include items from each 
module (Apple Thieves or Granny's Glasses; Cephalopods 
or Eagles), and (d) each form will maximize the test infor- 
mation (i.e., measurement precision) within the ability range 
of —2 to +2. ATA was implemented with the xx/RT package 
(Luo, 2019) in R (R Core Team, 2019). The ATA procedure 
was able to solve the optimization problem and yielded an 
optimal solution based on the specified constraints. See 
Table S3 (see Online Supplemental materials) for a sum- 
mary of the four test forms. The forms yielded very similar 
levels of internal consistency (Qpoim) = -87+ pom = -875 
CE orm3 = 90, ror = 90). 

Next, we used the scores from these forms to evaluate 
instructional sensitivity. Students in K—2 received inference 
instruction for 8 weeks, and scores on these forms were 
used to evaluate sensitivity to instruction from pretest to 
posttest. There was no control group in this proof-of-con- 
cept study. Rather, the focus was to evaluate whether MIA 
would be sensitive to monitor and evaluate progress of 
inference-focused instruction. Effect sizes varied from 
medium (Form 1: std. beta = 0.62, std. SE = 0.08, n* = .15; 
Form 3: std. beta = 0.63, std. SE = 0.08, n? = .20,) to small 
(Form 2: std. beta = 0.46, std. SE = 0.08, n* = .13; Form 
4: std. beta = 0.38, std. SE = 0.08, n? = .09) showing 
promise for MIA to be used for progress monitoring during 
inference-focused instruction. 


Discussion and Implications 


The overall aim of the current investigation was to develop 
and validate the initial version of MIA. MIA is a web-based 
measure of inference processes in K—2. MIA leverages the 
affordances of different media to evaluate inference pro- 
cesses in a nonreading context using age-appropriate fiction 
and nonfiction videos coupled with questioning. We evalu- 
ated MIA’s technical adequacy in a proof-of-concept study. 

Taken together, the results support the interpretation that 
MIA shows promise as a valid and reliable measure of 
inferencing in a nonreading context for students in K-2. 
MIA has a strong theoretical basis, conceptualizing infer- 
ence processes as unidimensional and independent of media 
factors (Kendeou et al., 2020). The validity argument for 
the current version of MIA shows that the underlying con- 
struct of inference processes assessed is unidimensional, 
has moderate correlations with measures of language and 
reading comprehension, and is with a difficulty level suit- 
able for students in K—2. These findings are consistent with 
those obtained for other nonreading inference measures 
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(LARRC & Muijselaar, 2018). Reliability/precision evi- 
dence for MIA and its parallel forms was also adequate. 

Despite the considerable strengths of MIA, this initial 
version has several limitations that need to be addressed 
with further development and refinement. First, the current 
four parallel forms use some of the same videos. Watching 
the same video twice may influence performance due to 
familiarity. Thus, we need to increase the pool of videos 
(and questions) to create unique, parallel forms suitable for 
progress-monitoring purposes. Second, although the cur- 
rent sample was ethnically and economically diverse, stu- 
dents from only one school district in the Midwestern 
United States were included with a small number of stu- 
dents in Grade 2, thereby limiting the generalizability of the 
current findings. Future work should include large and 
diverse samples in K—2 across the United States; because 
English Learner status was not available in the current sam- 
ple, it is also important to evaluate the appropriateness of 
MIA for English Learners in future studies. Finally, prog- 
ress monitoring and instructional utility of MIA scores were 
only evaluated at two time points, pretest and posttest, in 
the context of an 8-week supplemental instruction on infer- 
ence processes. In future studies, we need to evaluate the 
use of the parallel forms at multiple time points and in rela- 
tion to a control group to establish technical adequacy for 
both progress monitoring and efficacy of instruction. 

We contend that further development and refinement of 
MIA has the potential to produce a scalable, web-based assess- 
ment with fully automated test administrations and score 
reports that can help researchers and teachers evaluate infer- 
ence processes independent of decoding in the early years. 
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